Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor connection handling #1060

Merged
merged 12 commits into from
Sep 19, 2023

Conversation

paolobarbolini
Copy link
Contributor

@paolobarbolini paolobarbolini commented Jul 28, 2023

This is an attempt at refactoring Connection and ConnectionHandler to use poll instead of async .await syntax.

Advantages:

  1. Ability to continue reading while in the middle of writing: previously a write operation would block read operations from happening. Now as soon as an operation returns Poll::Pending the other operations can be polled
  2. Ability to continue reading and writing while in the middle of flushing: previously flushing blocked everything else until it completed
  3. Easier control of connection state and reduced error branches because of centralized connection error handling
  4. Removal of select! from a very critical path. The issues with it are the following:
    • It relies on the Futures polled by it's branches to be cancel safe. Cancel safety requires Dropping of a Future to be a no-op. While this seemed to be true it's easy to get wrong.
    • Execution of the expression blocks all other operations (see point 1 and 2)
  5. No more flush interval
  6. Higher write performance thanks to flattening of small writes (buffering them ourselves) and vectored writes when possible

Left to implement:

  • Flush interval
  • Vectored writes
  • Disconnect handling
  • Write backpressure
  • Reimplement flush
  • Cleanup commits
  • Benchmarks

Fixes #905
Fixes #923
Fixes #582
Closes #869
Closes #935
Closes #971
Closes #1070

@paolobarbolini paolobarbolini changed the title Refactor connection handling into a state machine Refactor connection handling Jul 29, 2023
@paolobarbolini
Copy link
Contributor Author

paolobarbolini commented Jul 29, 2023

Benchmarks as of now:

main
async-nats: publish throughput/32
                        time:   [202.71 µs 208.83 µs 215.22 µs]
                        thrpt:  [14.180 MiB/s 14.614 MiB/s 15.055 MiB/s]
async-nats: publish throughput/1024
                        time:   [331.07 µs 343.49 µs 356.15 µs]
                        thrpt:  [274.20 MiB/s 284.30 MiB/s 294.97 MiB/s]
Found 1 outliers among 30 measurements (3.33%)
  1 (3.33%) high mild
async-nats: publish throughput/8192
                        time:   [2.2131 ms 2.3912 ms 2.5751 ms]
                        thrpt:  [303.39 MiB/s 326.72 MiB/s 353.01 MiB/s]

async-nats: publish messages amount/32
                        time:   [200.31 µs 205.57 µs 211.80 µs]
                        thrpt:  [472.13 Kelem/s 486.45 Kelem/s 499.22 Kelem/s]
Found 2 outliers among 30 measurements (6.67%)
  2 (6.67%) high mild
async-nats: publish messages amount/1024
                        time:   [325.23 µs 335.40 µs 346.18 µs]
                        thrpt:  [288.86 Kelem/s 298.15 Kelem/s 307.48 Kelem/s]
Found 2 outliers among 30 measurements (6.67%)
  1 (3.33%) high mild
  1 (3.33%) high severe
async-nats: publish messages amount/8192
                        time:   [1.9831 ms 2.2042 ms 2.4116 ms]
                        thrpt:  [41.466 Kelem/s 45.367 Kelem/s 50.427 Kelem/s]

subscribe amount/32     time:   [1.5607 ms 1.6374 ms 1.6978 ms]
                        thrpt:  [58.900 Kelem/s 61.074 Kelem/s 64.075 Kelem/s]
subscribe amount/1024   time:   [1.1902 ms 1.2250 ms 1.2595 ms]
                        thrpt:  [79.400 Kelem/s 81.633 Kelem/s 84.018 Kelem/s]
subscribe amount/8192   time:   [4.2955 ms 4.3576 ms 4.4296 ms]
                        thrpt:  [22.575 Kelem/s 22.949 Kelem/s 23.280 Kelem/s]
Found 2 outliers among 30 measurements (6.67%)
  2 (6.67%) high severe
this branch against main
async-nats: publish throughput/32
                        time:   [190.19 µs 196.71 µs 203.55 µs]
                        thrpt:  [14.992 MiB/s 15.514 MiB/s 16.045 MiB/s]
                 change:
                        time:   [-8.7393% -4.7959% -0.8225%] (p = 0.03 < 0.05)
                        thrpt:  [+0.8294% +5.0375% +9.5762%]
                        Change within noise threshold.
Found 1 outliers among 30 measurements (3.33%)
  1 (3.33%) high mild
async-nats: publish throughput/1024
                        time:   [308.62 µs 318.89 µs 329.49 µs]
                        thrpt:  [296.38 MiB/s 306.23 MiB/s 316.43 MiB/s]
                 change:
                        time:   [-8.6554% -2.5471% +3.8930%] (p = 0.46 > 0.05)
                        thrpt:  [-3.7471% +2.6137% +9.4756%]
                        No change in performance detected.
Found 2 outliers among 30 measurements (6.67%)
  1 (3.33%) high mild
  1 (3.33%) high severe
async-nats: publish throughput/8192
                        time:   [1.4773 ms 1.5160 ms 1.5551 ms]
                        thrpt:  [502.36 MiB/s 515.34 MiB/s 528.85 MiB/s]
                 change:
                        time:   [-37.950% -31.428% -24.041%] (p = 0.00 < 0.05)
                        thrpt:  [+31.650% +45.833% +61.161%]
                        Performance has improved.
Found 3 outliers among 30 measurements (10.00%)
  2 (6.67%) high mild
  1 (3.33%) high severe

async-nats: publish messages amount/32
                        time:   [192.45 µs 197.46 µs 202.29 µs]
                        thrpt:  [494.35 Kelem/s 506.44 Kelem/s 519.61 Kelem/s]
                 change:
                        time:   [-11.902% -7.2112% -2.1045%] (p = 0.01 < 0.05)
                        thrpt:  [+2.1497% +7.7717% +13.510%]
                        Performance has improved.
Found 1 outliers among 30 measurements (3.33%)
  1 (3.33%) high mild
async-nats: publish messages amount/1024
                        time:   [309.51 µs 318.95 µs 330.52 µs]
                        thrpt:  [302.55 Kelem/s 313.53 Kelem/s 323.09 Kelem/s]
                 change:
                        time:   [-10.980% -3.2048% +4.6338%] (p = 0.45 > 0.05)
                        thrpt:  [-4.4286% +3.3109% +12.334%]
                        No change in performance detected.
Found 3 outliers among 30 measurements (10.00%)
  1 (3.33%) high mild
  2 (6.67%) high severe
async-nats: publish messages amount/8192
                        time:   [1.4486 ms 1.4860 ms 1.5305 ms]
                        thrpt:  [65.339 Kelem/s 67.293 Kelem/s 69.030 Kelem/s]
                 change:
                        time:   [-31.125% -23.573% -15.210%] (p = 0.00 < 0.05)
                        thrpt:  [+17.938% +30.843% +45.191%]
                        Performance has improved.
Found 3 outliers among 30 measurements (10.00%)
  1 (3.33%) high mild
  2 (6.67%) high severe

subscribe amount/32     time:   [4.3207 ms 4.4703 ms 4.6061 ms]
                        thrpt:  [21.710 Kelem/s 22.370 Kelem/s 23.144 Kelem/s]
                 change:
                        time:   [+173.31% +198.66% +227.86%] (p = 0.00 < 0.05)
                        thrpt:  [-69.499% -66.517% -63.412%]
                        Performance has regressed.
Found 3 outliers among 30 measurements (10.00%)
  1 (3.33%) low mild
  2 (6.67%) high mild
subscribe amount/1024   time:   [1.3827 ms 1.4205 ms 1.4670 ms]
                        thrpt:  [68.166 Kelem/s 70.398 Kelem/s 72.321 Kelem/s]
                 change:
                        time:   [+14.539% +18.584% +23.016%] (p = 0.00 < 0.05)
                        thrpt:  [-18.710% -15.672% -12.694%]
                        Performance has regressed.
Found 2 outliers among 30 measurements (6.67%)
  2 (6.67%) high mild
subscribe amount/8192   time:   [4.2687 ms 4.3939 ms 4.5255 ms]
                        thrpt:  [22.097 Kelem/s 22.759 Kelem/s 23.426 Kelem/s]
                 change:
                        time:   [-3.3419% +3.2701% +9.7699%] (p = 0.35 > 0.05)
                        thrpt:  [-8.9003% -3.1665% +3.4574%]
                        No change in performance detected.
Found 2 outliers among 30 measurements (6.67%)
  2 (6.67%) high mild

I'll look into what's happening with subscriptions

@paolobarbolini
Copy link
Contributor Author

I tried implementing vectored writes to see how hard it would be. It took very little. The results are awesome. I'll send them once this is merged though 😃

image

@paolobarbolini paolobarbolini force-pushed the connection-into-state-machine branch 2 times, most recently from c2ceea6 to 712ef12 Compare July 30, 2023 18:52
@paolobarbolini
Copy link
Contributor Author

paolobarbolini commented Jul 30, 2023

There's a problem I didn't take into account before: calling Client::flush() now means waiting both for past and future writes to be flushed, while previously it only acted on past writes. This makes calling flush a footgun in some cases.

I'm kind of feeling like we could live without it. When it comes to SQL databases for example I don't think I ever needed to flush the connection manually, so why should that apply for NATS? The real missing thing in nats.rs is the ability to wait for acknowledgments from the Core NATS protocol. Right now calling subscribe for example resolves as soon as the Command was written to the channel. What should instead happen is having it wait for the server to reply +OK. I think I can implement that in a follow-up PR.

@paolobarbolini paolobarbolini force-pushed the connection-into-state-machine branch 2 times, most recently from 53ce952 to 0d9c518 Compare July 31, 2023 09:34
@caspervonb
Copy link
Collaborator

caspervonb commented Jul 31, 2023

What should instead happen is having it wait for the server to reply +OK

+OK is only when verbose mode is on, and it isn't tied to any particular request 🤔

@paolobarbolini
Copy link
Contributor Author

What should instead happen is having it wait for the server to reply +OK

+OK is only when verbose mode is on, and it isn't tied to any particular request thinking

Ah. I wasn't expecting that 😄. So best case scenario we know when a command was written to the buffer. Nothing more 😞

@abalmos
Copy link

abalmos commented Jul 31, 2023

I'm kind of feeling like we could live without it. When it comes to SQL databases for example I don't think I ever needed to flush the connection manually, so why should that apply for NATS?

There are definitely use cases where minimum latency is vastly more important than throughput. flush is important then.

@paolobarbolini
Copy link
Contributor Author

I'm kind of feeling like we could live without it. When it comes to SQL databases for example I don't think I ever needed to flush the connection manually, so why should that apply for NATS?

There are definitely use cases where minimum latency is vastly more important than throughput. flush is important then.

hyper doesn't have flush and everything still runs smoothly.

@abalmos
Copy link

abalmos commented Jul 31, 2023

hyper doesn't have flush and everything still runs smoothly.

How does an HTTP library compare to NATS?

If you look back into issues on this repo, you will find people at odds with async's built-in buffering and using flush() to get around it. NATS probably shouldn't ignore the extreme “real-time” in real-time messaging (like sensor readings in a control system). There will always be latency in things, but sometimes you seek to eliminate what you can. For example, if you are only sending 100 msg/s from a small embedded device, and you care a great deal about latency, then buffering is exactly the wrong answer.

Breaking flush as you have described (waiting on future things) I would consider as a breaking change, and one that ought to really be reconsidered.

@Jarema
Copy link
Member

Jarema commented Jul 31, 2023

There's a problem I didn't take into account before: calling Client::flush() now means waiting both for past and future writes to be flushed, while previously it only acted on past writes. This makes calling flush a footgun in some cases.

I'm kind of feeling like we could live without it. When it comes to SQL databases for example I don't think I ever needed to flush the connection manually, so why should that apply for NATS? The real missing thing in nats.rs is the ability to wait for acknowledgments from the Core NATS protocol. Right now calling subscribe for example resolves as soon as the Command was written to the channel. What should instead happen is having it wait for the server to reply +OK. I think I can implement that in a follow-up PR.

As @caspervonb mentioned, you get +OK only in verbose mode, and we never use it.

NATS is pretty specific because you might be publishing just few bytes and want to reach subscribers as soon as possible - with latency almost equal to latency between clients and the server. The very low latency is critical aspect of NATS that we can't compromise. In many NATS use cases, every millisecond counts.

That balance between as-high-as-possible throughput and as-low-as-possible latency will probably be impossible with just plain and simple non-flushing behaviour. Maybe async I/O will make it simpler in the future (and for example allow not buffering at all?), but still we probably should not compromise older systems.

That's why we were thinking about getting rid of BufWriter and doing our own implementaton, where we can flush with more control.

@paolobarbolini
Copy link
Contributor Author

paolobarbolini commented Jul 31, 2023

I agree that NATS should optimize for real time while also having the option to be more lazy in it's flushing, allowing less but bigger network packets to be sent. The reason why I said manual calls to flush felt to me as something we could live without is that I feel like other libraries optimize for these things without the need of telling the user to manually flush.

Currently by default a 1ms flush interval is configured. This means that compared to an implementation which flushes after every write, here worse case scenario you get 1ms extra write latency (except that you're not really getting 1ms. Continue reading). I don't like the timer approach, but I feel like this it not a bad default.

For the extreme realtime cases I feel like no timer should be present at all, and instead flushing should happen as soon as at least one command is ready to be sent. Asking the user to manually flush is a giant workaround. There are a bunch of libraries that can autonomously flush without having the user do it manually. The implementation already does write pipelining. Manual flushing actually slows down the library user from publishing quicker. It doesn't make it faster unless they tokio::spawn the flush call and the library is able to intermix write and flush calls (which without this PR you're not doing).
Your library as of now is actually slower then because while you flush you can't do anything else, even if you tokio::spawn to a separate tokio green thread, so here's destroyed your realtime dream.

With that said let me explain what this PR really breaks: this PR does not break or remove flushing, it actually gives the entire write path the chance to max out what the TCP connection is capable of (excluding vectored writes which is another very nice improvement which will come later). The only thing it breaks is that if you manually call .flush().await you might wait more than it actually takes to write your messages. This is because now writing and flushing happen concurrently, so the flush operation doesn't end once your messages finished writing but after all buffered messages have. Let's draw out a scenario using simple numbers:

Before

  • write 1000 bytes
  • manually call .flush().await
  • call flush. kernel flushes 800 bytes to the network. Can't flush the remaining 200 right now. Poll::Pending
  • wait for the flush waker. Now the connection is stuck and can't do anything else until flushing has completed. You can't read, even though that would be an sort-of independent operation, and you can't write more stuff in the intermediate buffers
  • we get woken up and are able to write the remaining 200 bytes
  • The .flush() Future resolves and the application knows the 1000 bytes have been sent to the network
  • another write for 2000 bytes comes in and the cycle repeats

After

  • write 1000 bytes
  • manually call .flush().await
  • call flush. kernel flushes 800 bytes to the network. Can't flush the remaining 200 right now. Poll::Pending
  • because we are handling things concurrently, instead of waiting for flush to be ready again we do some more work
  • we write 2000 bytes from a command enqueued by a different Tokio green thread
  • we call flush again an it's now able to write 400 bytes, but because it wasn't able to empty the write buffers (we gave it 2000 more bytes in the previous step) it returns Poll::Pending again (even though it was able to go forward with some flushing work)
  • we go back to doing something else while waiting for flush to be ready again
  • at some point flush is able to empty the write buffers so it returns Poll::Ready and the .flush() Future resolves

As you can see the actual write performance increased, because while the kernel was flushing we were able to add more data to the available portion of it's buffers (or other userspace buffers we have going on). Plus let's not forget we were able to read this whole time. This actually potentially allows us to send bigger network packets. The only problem is that because .flush() only ever returns Poll::Ready once all buffers have been emptied, the fact that now we are able to give it more work (while staying within the buffer's limits) means it'll be harder to get it done. This wouldn't be a problem if it weren't for the fact we're asking flush whether if it has finished or not, and if the network isn't fast enough it'll tell you it hasn't finished, so now your .flush()` call takes longer to resolve.

@paolobarbolini
Copy link
Contributor Author

paolobarbolini commented Jul 31, 2023

NATS is pretty specific because you might be publishing just few bytes and want to reach subscribers as soon as possible - with latency almost equal to latency between clients and the server. The very low latency is critical aspect of NATS that we can't compromise. In many NATS use cases, every millisecond counts.

That balance between as-high-as-possible throughput and as-low-as-possible latency will probably be impossible with just plain and simple non-flushing behaviour. Maybe async I/O will make it simpler in the future (and for example allow not buffering at all?), but still we probably should not compromise older systems.

That's why we were thinking about getting rid of BufWriter and doing our own implementaton, where we can flush with more control.

While a custom BufWriter might increase performance a bit it won't matter if flushing blocks reads and writes. What was done here with the VecDeque and vectored writes coming in the future could be a BufWriter replacement. It alone won't have improved latency though. The only reason why I did both in the same PR is that it would have been a pain to implement the second part in a different way.

The bottleneck is that you're not taking full advantage of what the current async tools are giving you. Reading and writing should happen concurrently between each other and with flushing, otherwise you'll have dead moments where no I/O is happening in userspace (and to some extent in the kernel). And whatever you do you can't live without flushing. Guaranteed when dealing with rustls or other TLS libraries.
The manual flush calls are a workaround. They could work if they didn't wait for flushing to actually finish, but still why have them if we trust the internals to already be as quick as possible.

@Jarema
Copy link
Member

Jarema commented Aug 1, 2023

@paolobarbolini I just did a very simple test:

    #[tokio::test]
    async fn request_bench() {
        use futures::stream::StreamExt;

        let server = nats_server::run_basic_server();

        let (tx, rx) = tokio::sync::oneshot::channel();

        tokio::spawn({
            let url = server.client_url().clone();
            async move {
                let client = async_nats::connect(url).await.unwrap();
                let mut subscription = client.subscribe("request".into()).await.unwrap();
                client.flush().await.unwrap();
                tx.send(()).unwrap();

                while let Some(message) = subscription.next().await {
                    client
                        .publish(message.reply.unwrap(), "".into())
                        .await
                        .unwrap();
                    client.flush().await.unwrap();
                }
            }
        });

        let client = async_nats::connect(server.client_url()).await.unwrap();
        rx.await.unwrap();

        let total = std::time::Instant::now();
        for _ in 0..100 {
            let now = std::time::Instant::now();
            let request = client.request("request".into(), "".into()).await.unwrap();
            let elapsed = now.elapsed();
            println!("request took: {:?}", elapsed);
        }
        println!("took: {:?}", total.elapsed());
    }

and run it against your branch and main.

main

running 1 test
request took: 386.5µs
request took: 347.334µs
request took: 325.792µs
request took: 312.375µs
request took: 401.667µs
request took: 293.042µs
request took: 267.291µs
request took: 289.584µs
request took: 311.125µs
request took: 278.708µs
request took: 307.666µs
request took: 305.625µs
request took: 282.375µs
request took: 311.5µs
request took: 257.459µs
request took: 211.375µs
request took: 225.083µs
request took: 238.333µs
request took: 235.292µs
request took: 220.166µs
request took: 309.167µs
request took: 347.625µs
request took: 246.208µs
request took: 220.083µs
request took: 275.166µs
request took: 208.542µs
request took: 201.75µs
request took: 242.667µs
request took: 289.167µs
request took: 258.583µs
request took: 243.5µs
request took: 236.458µs
request took: 235.75µs
request took: 213.5µs
request took: 226.167µs
request took: 201.708µs
request took: 210.541µs
request took: 241.709µs
request took: 213.209µs
request took: 213.083µs
request took: 218.209µs
request took: 216.166µs
request took: 231.25µs
request took: 226.25µs
request took: 201.416µs
request took: 225.083µs
request took: 214.5µs
request took: 154.666µs
request took: 149.334µs
request took: 146.875µs
request took: 199.917µs
request took: 158.333µs
request took: 165.792µs
request took: 200.083µs
request took: 150.042µs
request took: 139.167µs
request took: 158µs
request took: 159.833µs
request took: 131.25µs
request took: 170.167µs
request took: 165.667µs
request took: 146.75µs
request took: 170.625µs
request took: 175.083µs
request took: 139.209µs
request took: 154.25µs
request took: 128.458µs
request took: 130.458µs
request took: 155.042µs
request took: 154.042µs
request took: 151.75µs
request took: 154µs
request took: 152.708µs
request took: 200.208µs
request took: 186.5µs
request took: 179.542µs
request took: 252.083µs
request took: 194.375µs
request took: 190.709µs
request took: 177.625µs
request took: 186.875µs
request took: 191.875µs
request took: 206.209µs
request took: 179.75µs
request took: 201.541µs
request took: 181.292µs
request took: 188.292µs
request took: 183.333µs
request took: 191.917µs
request took: 182.417µs
request took: 190.667µs
request took: 181.708µs
request took: 182.834µs
request took: 233.792µs
request took: 267.417µs
request took: 195.459µs
request took: 216.958µs
request took: 203.333µs
request took: 220.083µs
request took: 199.917µs
took: 22.05625ms

this PR

running 1 test
request took: 1.636792ms
request took: 2.731041ms
request took: 1.58475ms
request took: 1.566958ms
request took: 2.643208ms
request took: 1.454792ms
request took: 2.466917ms
request took: 1.493792ms
request took: 2.574417ms
request took: 1.496417ms
request took: 2.665875ms
request took: 1.534542ms
request took: 2.622291ms
request took: 2.609041ms
request took: 1.47ms
request took: 2.647458ms
request took: 1.451333ms
request took: 2.544708ms
request took: 2.532542ms
request took: 2.517083ms
request took: 2.507375ms
request took: 2.603041ms
request took: 2.570875ms
request took: 1.361167ms
request took: 2.481333ms
request took: 2.571875ms
request took: 2.548333ms
request took: 2.52175ms
request took: 2.557666ms
request took: 1.356833ms
request took: 2.445416ms
request took: 2.476167ms
request took: 2.543875ms
request took: 2.538417ms
request took: 2.460625ms
request took: 2.619833ms
request took: 2.529833ms
request took: 1.352584ms
request took: 2.525375ms
request took: 2.547ms
request took: 2.559417ms
request took: 1.445042ms
request took: 2.46875ms
request took: 1.389834ms
request took: 2.439042ms
request took: 2.509292ms
request took: 2.513542ms
request took: 2.538291ms
request took: 2.528375ms
request took: 2.490459ms
request took: 2.564542ms
request took: 1.363833ms
request took: 2.516792ms
request took: 2.544042ms
request took: 2.559834ms
request took: 1.405125ms
request took: 2.496167ms
request took: 2.517917ms
request took: 2.500666ms
request took: 2.473292ms
request took: 2.540166ms
request took: 2.5195ms
request took: 2.458042ms
request took: 2.512125ms
request took: 2.523209ms
request took: 1.33225ms
request took: 2.49025ms
request took: 2.469333ms
request took: 2.486875ms
request took: 2.546042ms
request took: 2.488375ms
request took: 2.522208ms
request took: 2.511334ms
request took: 2.491667ms
request took: 2.538208ms
request took: 2.559042ms
request took: 2.411417ms
request took: 2.527666ms
request took: 2.467333ms
request took: 2.443625ms
request took: 2.483ms
request took: 2.47725ms
request took: 2.533583ms
request took: 2.491792ms
request took: 2.444958ms
request took: 2.4395ms
request took: 2.453959ms
request took: 2.480542ms
request took: 2.507917ms
request took: 2.518084ms
request took: 2.484042ms
request took: 2.501208ms
request took: 2.491292ms
request took: 2.497958ms
request took: 2.494875ms
request took: 2.515584ms
request took: 2.519917ms
request took: 2.520375ms
request took: 2.465583ms
request took: 2.533583ms
took: 234.451083ms

Removing the flush from publish on response makes them both as slow as this PR code.
I delierately used two client connections.

This is a huge difference.

So, we are well aware of the fact that there are optimizations to flushing/writing mechanism possible, but this does not seem to be the way.

Also note, that most, if not all NATS clients have manual flush option for exactly this reason.
Maybe we can be smarter, but we're not as of yet ;).

@paolobarbolini
Copy link
Contributor Author

If you want to write at the highest possible speed remove the timer

diff --git a/async-nats/src/lib.rs b/async-nats/src/lib.rs
index d516697..1a3345d 100644
--- a/async-nats/src/lib.rs
+++ b/async-nats/src/lib.rs
@@ -460,7 +460,7 @@ impl ConnectionHandler {
                 }
 
                 if !self.handler.is_flushing && self.handler.connection.needs_flush() {
-                    self.handler.is_flushing = self.handler.flush_interval.poll_tick(cx).is_ready();
+                    self.handler.is_flushing = true;
                 }
 
                 if self.handler.is_flushing {

@Jarema
Copy link
Member

Jarema commented Aug 2, 2023

Ah, you're right.

@paolobarbolini I did a quick check for how flush-less approach could work here #1070

@paolobarbolini
Copy link
Contributor Author

paolobarbolini commented Aug 2, 2023

I liked the idea of #1070 of only flushing if there's nothing left to write. I've incorporated it as an experiment. It doesn't seem to make much of a difference with the current benchmarks.

I've even tried removing BufWriter and the worse regressions were -10% throughput on the publish /32 benchmarks, while they made the subscribe /32 benchmark have +18% throughput.

Updated benchmarks against main.

EDIT: redid the benchmarks with #1073 on both branches

benchmark
async-nats: publish throughput/32
                        time:   [174.82 µs 180.19 µs 185.82 µs]
                        thrpt:  [16.423 MiB/s 16.936 MiB/s 17.456 MiB/s]
                 change:
                        time:   [-15.278% -10.978% -5.4405%] (p = 0.00 < 0.05)
                        thrpt:  [+5.7535% +12.332% +18.032%]
                        Performance has improved.
Found 2 outliers among 30 measurements (6.67%)
  1 (3.33%) high mild
  1 (3.33%) high severe
async-nats: publish throughput/1024
                        time:   [234.28 µs 240.81 µs 249.56 µs]
                        thrpt:  [391.31 MiB/s 405.53 MiB/s 416.84 MiB/s]
                 change:
                        time:   [-30.315% -26.563% -23.007%] (p = 0.00 < 0.05)
                        thrpt:  [+29.882% +36.172% +43.503%]
                        Performance has improved.
Found 1 outliers among 30 measurements (3.33%)
  1 (3.33%) high mild
async-nats: publish throughput/8192
                        time:   [481.57 µs 492.94 µs 504.07 µs]
                        thrpt:  [1.5136 GiB/s 1.5477 GiB/s 1.5843 GiB/s]
                 change:
                        time:   [-76.946% -74.794% -72.318%] (p = 0.00 < 0.05)
                        thrpt:  [+261.24% +296.73% +333.76%]
                        Performance has improved.
Found 1 outliers among 30 measurements (3.33%)
  1 (3.33%) high mild

async-nats: publish messages amount/32
                        time:   [178.25 µs 183.56 µs 189.69 µs]
                        thrpt:  [527.17 Kelem/s 544.79 Kelem/s 561.01 Kelem/s]
                 change:
                        time:   [-12.759% -8.0142% -3.3658%] (p = 0.00 < 0.05)
                        thrpt:  [+3.4830% +8.7124% +14.625%]
                        Performance has improved.
async-nats: publish messages amount/1024
                        time:   [238.71 µs 248.44 µs 259.19 µs]
                        thrpt:  [385.81 Kelem/s 402.52 Kelem/s 418.91 Kelem/s]
                 change:
                        time:   [-28.389% -24.318% -20.255%] (p = 0.00 < 0.05)
                        thrpt:  [+25.400% +32.132% +39.643%]
                        Performance has improved.
Found 3 outliers among 30 measurements (10.00%)
  3 (10.00%) high mild
async-nats: publish messages amount/8192
                        time:   [479.35 µs 492.24 µs 504.25 µs]
                        thrpt:  [198.32 Kelem/s 203.15 Kelem/s 208.62 Kelem/s]
                 change:
                        time:   [-80.353% -78.560% -76.507%] (p = 0.00 < 0.05)
                        thrpt:  [+325.65% +366.41% +408.98%]
                        Performance has improved.
Found 1 outliers among 30 measurements (3.33%)
  1 (3.33%) high mild

subscribe amount/32     time:   [3.8229 ms 4.0299 ms 4.2031 ms]
                        thrpt:  [23.792 Kelem/s 24.814 Kelem/s 26.158 Kelem/s]
                 change:
                        time:   [+157.09% +178.71% +204.45%] (p = 0.00 < 0.05)
                        thrpt:  [-67.154% -64.121% -61.103%]
                        Performance has regressed.
subscribe amount/1024   time:   [1.4314 ms 1.4791 ms 1.5265 ms]
                        thrpt:  [65.510 Kelem/s 67.607 Kelem/s 69.859 Kelem/s]
                 change:
                        time:   [+9.3931% +16.298% +23.189%] (p = 0.00 < 0.05)
                        thrpt:  [-18.824% -14.014% -8.5866%]
                        Performance has regressed.
Benchmarking subscribe amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 30 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 10.
subscribe amount/8192   time:   [9.5134 ms 12.433 ms 14.929 ms]
                        thrpt:  [6.6985 Kelem/s 8.0432 Kelem/s 10.511 Kelem/s]
                 change:
                        time:   [+59.441% +109.45% +162.47%] (p = 0.00 < 0.05)
                        thrpt:  [-61.901% -52.255% -37.281%]
                        Performance has regressed.

@caspervonb
Copy link
Collaborator

I've even tried removing BufWriter and the worse regressions were -10% throughput on the publish /32 benchmarks, while they made the subscribe /32 benchmark have +18% throughput.

Try it with a larger buffer, e.g #971 has been showing night and day difference for me.

@paolobarbolini
Copy link
Contributor Author

paolobarbolini commented Aug 3, 2023

Making BufWriter use a 64k buffer instead of 8k made it about 10% better. What really changed everything was flattening small writes myself and gaining a 50% improvement. Now my other Ryzen 9 5900X machine almost reaches 24 Gbit/s on the 8196 publish benchmark.

@paolobarbolini
Copy link
Contributor Author

paolobarbolini commented Aug 5, 2023

Updated benchmarks

results
nats::publish_throughput/32
                        time:   [137.87 ms 173.47 ms 207.31 ms]
                        thrpt:  [73.604 MiB/s 87.961 MiB/s 110.67 MiB/s]
                 change:
                        time:   [-55.612% -45.766% -34.838%] (p = 0.00 < 0.05)
                        thrpt:  [+53.463% +84.386% +125.29%]
                        Performance has improved.
nats::publish_throughput/1024
                        time:   [233.25 ms 243.69 ms 253.17 ms]
                        thrpt:  [1.8835 GiB/s 1.9567 GiB/s 2.0443 GiB/s]
                 change:
                        time:   [-61.865% -59.174% -56.090%] (p = 0.00 < 0.05)
                        thrpt:  [+127.74% +144.94% +162.22%]
                        Performance has improved.
Benchmarking nats::publish_throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 11.7s.
nats::publish_throughput/8192
                        time:   [1.2802 s 1.3386 s 1.3853 s]
                        thrpt:  [2.7538 GiB/s 2.8498 GiB/s 2.9797 GiB/s]
                 change:
                        time:   [-66.469% -64.501% -62.440%] (p = 0.00 < 0.05)
                        thrpt:  [+166.24% +181.70% +198.23%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

nats::publish_amount/32 time:   [116.98 ms 144.57 ms 173.83 ms]
                        thrpt:  [2.8763 Melem/s 3.4585 Melem/s 4.2742 Melem/s]
                 change:
                        time:   [-63.112% -53.451% -42.762%] (p = 0.00 < 0.05)
                        thrpt:  [+74.708% +114.83% +171.09%]
                        Performance has improved.
nats::publish_amount/1024
                        time:   [238.44 ms 248.83 ms 258.94 ms]
                        thrpt:  [1.9309 Melem/s 2.0094 Melem/s 2.0970 Melem/s]
                 change:
                        time:   [-58.584% -55.421% -52.049%] (p = 0.00 < 0.05)
                        thrpt:  [+108.55% +124.32% +141.46%]
                        Performance has improved.
Benchmarking nats::publish_amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.6s.
nats::publish_amount/8192
                        time:   [1.2594 s 1.3284 s 1.3952 s]
                        thrpt:  [358.37 Kelem/s 376.39 Kelem/s 397.01 Kelem/s]
                 change:
                        time:   [-68.415% -66.515% -64.580%] (p = 0.00 < 0.05)
                        thrpt:  [+182.33% +198.64% +216.61%]
                        Performance has improved.

nats::subscribe_amount/32
                        time:   [282.06 ms 313.02 ms 342.80 ms]
                        thrpt:  [1.4586 Melem/s 1.5973 Melem/s 1.7727 Melem/s]
                 change:
                        time:   [-36.946% -27.946% -18.320%] (p = 0.00 < 0.05)
                        thrpt:  [+22.429% +38.785% +58.593%]
                        Performance has improved.
nats::subscribe_amount/1024
                        time:   [375.49 ms 401.15 ms 426.16 ms]
                        thrpt:  [1.1733 Melem/s 1.2464 Melem/s 1.3316 Melem/s]
                 change:
                        time:   [-31.891% -26.226% -20.834%] (p = 0.00 < 0.05)
                        thrpt:  [+26.317% +35.550% +46.823%]
                        Performance has improved.
Benchmarking nats::subscribe_amount/8192: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 16.1s.
nats::subscribe_amount/8192
                        time:   [1.5520 s 1.5832 s 1.6130 s]
                        thrpt:  [309.99 Kelem/s 315.82 Kelem/s 322.17 Kelem/s]
                 change:
                        time:   [-73.768% -73.069% -72.328%] (p = 0.00 < 0.05)
                        thrpt:  [+261.38% +271.31% +281.21%]
                        Performance has improved.

Benchmarking nats::request_amount/32: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.3s.
nats::request_amount/32 time:   [901.08 ms 913.14 ms 925.03 ms]
                        thrpt:  [10.810 Kelem/s 10.951 Kelem/s 11.098 Kelem/s]
                 change:
                        time:   [-4.7387% -2.3639% +0.2315%] (p = 0.10 > 0.05)
                        thrpt:  [-0.2309% +2.4211% +4.9745%]
                        No change in performance detected.
Benchmarking nats::request_amount/1024: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.4s.
nats::request_amount/1024
                        time:   [902.87 ms 932.18 ms 951.70 ms]
                        thrpt:  [10.507 Kelem/s 10.728 Kelem/s 11.076 Kelem/s]
                 change:
                        time:   [-3.5725% +0.1295% +3.4167%] (p = 0.96 > 0.05)
                        thrpt:  [-3.3038% -0.1293% +3.7048%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low severe
Benchmarking nats::request_amount/8192: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.5s.
nats::request_amount/8192
                        time:   [966.84 ms 977.62 ms 987.97 ms]
                        thrpt:  [10.122 Kelem/s 10.229 Kelem/s 10.343 Kelem/s]
                 change:
                        time:   [-12.615% -10.990% -9.3744%] (p = 0.00 < 0.05)
                        thrpt:  [+10.344% +12.347% +14.436%]
                        Performance has improved.

Benchmarking jetstream::sync_publish_throughput/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 38.2s.
jetstream::sync_publish_throughput/32
                        time:   [3.6891 s 3.7582 s 3.8207 s]
                        thrpt:  [408.96 KiB/s 415.76 KiB/s 423.54 KiB/s]
                 change:
                        time:   [-5.5278% -2.9984% -0.3372%] (p = 0.05 > 0.05)
                        thrpt:  [+0.3384% +3.0911% +5.8512%]
                        No change in performance detected.
Benchmarking jetstream::sync_publish_throughput/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 39.0s.
jetstream::sync_publish_throughput/1024
                        time:   [3.9759 s 4.0613 s 4.1395 s]
                        thrpt:  [11.796 MiB/s 12.023 MiB/s 12.281 MiB/s]
                 change:
                        time:   [-4.0732% -1.9653% +0.0650%] (p = 0.10 > 0.05)
                        thrpt:  [-0.0649% +2.0047% +4.2461%]
                        No change in performance detected.
Benchmarking jetstream::sync_publish_throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 47.3s.
jetstream::sync_publish_throughput/8192
                        time:   [4.8534 s 4.9060 s 4.9554 s]
                        thrpt:  [78.828 MiB/s 79.623 MiB/s 80.485 MiB/s]
                 change:
                        time:   [-16.545% -15.424% -14.377%] (p = 0.00 < 0.05)
                        thrpt:  [+16.791% +18.237% +19.825%]
                        Performance has improved.

Benchmarking jetstream sync publish messages amount/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 37.3s.
jetstream sync publish messages amount/32
                        time:   [3.7332 s 3.7915 s 3.8514 s]
                        thrpt:  [12.982 Kelem/s 13.187 Kelem/s 13.393 Kelem/s]
                 change:
                        time:   [-5.0490% -3.0413% -1.2430%] (p = 0.01 < 0.05)
                        thrpt:  [+1.2587% +3.1367% +5.3175%]
                        Performance has improved.
Benchmarking jetstream sync publish messages amount/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 38.8s.
jetstream sync publish messages amount/1024
                        time:   [3.9098 s 3.9921 s 4.0747 s]
                        thrpt:  [12.271 Kelem/s 12.525 Kelem/s 12.788 Kelem/s]
                 change:
                        time:   [-5.3982% -3.3109% -1.0617%] (p = 0.02 < 0.05)
                        thrpt:  [+1.0731% +3.4242% +5.7062%]
                        Performance has improved.
Benchmarking jetstream sync publish messages amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 49.5s.
jetstream sync publish messages amount/8192
                        time:   [4.8642 s 4.9132 s 4.9615 s]
                        thrpt:  [10.078 Kelem/s 10.177 Kelem/s 10.279 Kelem/s]
                 change:
                        time:   [-16.648% -15.657% -14.723%] (p = 0.00 < 0.05)
                        thrpt:  [+17.265% +18.564% +19.973%]
                        Performance has improved.

Benchmarking jetstream async publish throughput/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.9s or enable flat sampling.
jetstream async publish throughput/32
                        time:   [182.78 ms 192.31 ms 204.40 ms]
                        thrpt:  [7.4653 MiB/s 7.9344 MiB/s 8.3484 MiB/s]
                 change:
                        time:   [-22.437% -7.9578% +9.3424%] (p = 0.37 > 0.05)
                        thrpt:  [-8.5441% +8.6458% +28.928%]
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
Benchmarking jetstream async publish throughput/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.5s or enable flat sampling.
jetstream async publish throughput/1024
                        time:   [175.23 ms 209.87 ms 250.79 ms]
                        thrpt:  [194.70 MiB/s 232.66 MiB/s 278.66 MiB/s]
                 change:
                        time:   [-25.733% -6.6874% +13.813%] (p = 0.54 > 0.05)
                        thrpt:  [-12.136% +7.1667% +34.650%]
                        No change in performance detected.
jetstream async publish throughput/8192
                        time:   [309.57 ms 371.39 ms 445.67 ms]
                        thrpt:  [876.50 MiB/s 1.0271 GiB/s 1.2322 GiB/s]
                 change:
                        time:   [-53.459% -42.865% -30.568%] (p = 0.00 < 0.05)
                        thrpt:  [+44.025% +75.024% +114.86%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild

jetstream::async_publish_messages_amount/32
                        time:   [104.12 ms 114.29 ms 127.24 ms]
                        thrpt:  [392.95 Kelem/s 437.50 Kelem/s 480.20 Kelem/s]
                 change:
                        time:   [-54.561% -46.081% -35.125%] (p = 0.00 < 0.05)
                        thrpt:  [+54.142% +85.462% +120.07%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking jetstream::async_publish_messages_amount/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.9s or enable flat sampling.
jetstream::async_publish_messages_amount/1024
                        time:   [149.33 ms 174.67 ms 221.41 ms]
                        thrpt:  [225.82 Kelem/s 286.25 Kelem/s 334.83 Kelem/s]
                 change:
                        time:   [-39.572% -16.557% +9.8493%] (p = 0.23 > 0.05)
                        thrpt:  [-8.9662% +19.842% +65.486%]
                        No change in performance detected.
jetstream::async_publish_messages_amount/8192
                        time:   [372.50 ms 475.86 ms 580.71 ms]
                        thrpt:  [86.101 Kelem/s 105.07 Kelem/s 134.23 Kelem/s]
                 change:
                        time:   [-43.316% -27.027% -10.906%] (p = 0.01 < 0.05)
                        thrpt:  [+12.241% +37.038% +76.418%]
                        Performance has improved.

@paolobarbolini paolobarbolini marked this pull request as ready for review August 8, 2023 13:29
@paolobarbolini paolobarbolini force-pushed the connection-into-state-machine branch 2 times, most recently from 5f7c74c to 4acb7c8 Compare August 9, 2023 07:18
@paolobarbolini
Copy link
Contributor Author

I've added some docs

@paolobarbolini
Copy link
Contributor Author

paolobarbolini commented Aug 31, 2023

Benchmarks on a Ryzen 5900X server

main (8661572)

Results
nats::publish_throughput/32
                        time:   [279.37 ms 297.73 ms 313.74 ms]
                        thrpt:  [48.635 MiB/s 51.250 MiB/s 54.619 MiB/s]
Benchmarking nats::publish_throughput/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.1s.
nats::publish_throughput/1024
                        time:   [541.56 ms 559.82 ms 578.66 ms]
                        thrpt:  [843.82 MiB/s 872.21 MiB/s 901.62 MiB/s]
Benchmarking nats::publish_throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 42.5s.
nats::publish_throughput/8192
                        time:   [3.7260 s 3.8740 s 4.0184 s]
                        thrpt:  [972.10 MiB/s 1008.3 MiB/s 1.0238 GiB/s]

nats::publish_amount/32 time:   [262.30 ms 284.95 ms 308.48 ms]
                        thrpt:  [1.6208 Melem/s 1.7547 Melem/s 1.9062 Melem/s]
Benchmarking nats::publish_amount/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.5s.
nats::publish_amount/1024
                        time:   [491.33 ms 520.60 ms 549.44 ms]
                        thrpt:  [910.01 Kelem/s 960.43 Kelem/s 1.0177 Melem/s]
Benchmarking nats::publish_amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 41.0s.
nats::publish_amount/8192
                        time:   [3.5932 s 3.7024 s 3.8187 s]
                        thrpt:  [130.94 Kelem/s 135.05 Kelem/s 139.15 Kelem/s]

Benchmarking nats::subscribe_amount/32: Collecting 10 samples in estimated 8.2584 s (20 iterations)thread 'tokio-runtime-worker' panicked at async-nats/benches/core_nats.rs:105:38:
called `Result::unwrap()` on an `Err` value: PublishError(SendError { .. })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
nats::subscribe_amount/32
                        time:   [393.24 ms 414.08 ms 434.02 ms]
                        thrpt:  [1.1520 Melem/s 1.2075 Melem/s 1.2715 Melem/s]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) low mild
Benchmarking nats::subscribe_amount/1024: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.5s.
nats::subscribe_amount/1024
                        time:   [505.06 ms 525.38 ms 544.66 ms]
                        thrpt:  [918.00 Kelem/s 951.69 Kelem/s 989.99 Kelem/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking nats::subscribe_amount/8192: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 59.8s.
nats::subscribe_amount/8192
                        time:   [5.6543 s 5.7552 s 5.8571 s]
                        thrpt:  [85.367 Kelem/s 86.879 Kelem/s 88.428 Kelem/s]

Benchmarking nats::request_amount/32: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.2s.
nats::request_amount/32 time:   [809.48 ms 824.71 ms 839.17 ms]
                        thrpt:  [11.917 Kelem/s 12.126 Kelem/s 12.354 Kelem/s]
Benchmarking nats::request_amount/1024: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.1s.
nats::request_amount/1024
                        time:   [826.11 ms 849.34 ms 869.14 ms]
                        thrpt:  [11.506 Kelem/s 11.774 Kelem/s 12.105 Kelem/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking nats::request_amount/8192: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.1s.
nats::request_amount/8192
                        time:   [1.0047 s 1.0288 s 1.0482 s]
                        thrpt:  [9.5400 Kelem/s 9.7199 Kelem/s 9.9527 Kelem/s]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild

Benchmarking jetstream::sync_publish_throughput/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 40.1s.
jetstream::sync_publish_throughput/32
                        time:   [3.9125 s 3.9927 s 4.0625 s]
                        thrpt:  [384.62 KiB/s 391.34 KiB/s 399.36 KiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking jetstream::sync_publish_throughput/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 40.4s.
jetstream::sync_publish_throughput/1024
                        time:   [4.1443 s 4.1921 s 4.2370 s]
                        thrpt:  [11.524 MiB/s 11.648 MiB/s 11.782 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking jetstream::sync_publish_throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 57.9s.
jetstream::sync_publish_throughput/8192
                        time:   [5.8393 s 5.8768 s 5.9077 s]
                        thrpt:  [66.121 MiB/s 66.469 MiB/s 66.896 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

Benchmarking jetstream sync publish messages amount/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 40.1s.
jetstream sync publish messages amount/32
                        time:   [3.9845 s 4.0386 s 4.0901 s]
                        thrpt:  [12.225 Kelem/s 12.381 Kelem/s 12.548 Kelem/s]
Benchmarking jetstream sync publish messages amount/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 42.2s.
jetstream sync publish messages amount/1024
                        time:   [4.1796 s 4.2357 s 4.2810 s]
                        thrpt:  [11.680 Kelem/s 11.804 Kelem/s 11.963 Kelem/s]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild
Benchmarking jetstream sync publish messages amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 57.8s.
jetstream sync publish messages amount/8192
                        time:   [5.8897 s 5.9316 s 5.9680 s]
                        thrpt:  [8.3781 Kelem/s 8.4294 Kelem/s 8.4894 Kelem/s]

jetstream async publish throughput/32
                        time:   [234.66 ms 288.98 ms 346.41 ms]
                        thrpt:  [4.4048 MiB/s 5.2802 MiB/s 6.5025 MiB/s]
jetstream async publish throughput/1024
                        time:   [235.27 ms 283.35 ms 337.76 ms]
                        thrpt:  [144.57 MiB/s 172.33 MiB/s 207.54 MiB/s]
Benchmarking jetstream async publish throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.3s.
jetstream async publish throughput/8192
                        time:   [654.44 ms 742.94 ms 831.00 ms]
                        thrpt:  [470.07 MiB/s 525.79 MiB/s 596.89 MiB/s]

jetstream::async_publish_messages_amount/32
                        time:   [251.40 ms 292.01 ms 327.31 ms]
                        thrpt:  [152.76 Kelem/s 171.23 Kelem/s 198.89 Kelem/s]
jetstream::async_publish_messages_amount/1024
                        time:   [302.53 ms 338.62 ms 380.52 ms]
                        thrpt:  [131.40 Kelem/s 147.66 Kelem/s 165.27 Kelem/s]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
Benchmarking jetstream::async_publish_messages_amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.5s.
jetstream::async_publish_messages_amount/8192
                        time:   [700.60 ms 754.29 ms 809.97 ms]
                        thrpt:  [61.730 Kelem/s 66.288 Kelem/s 71.367 Kelem/s]

First part of the PR (5a73aac) against main

This was the original scope of the PR. It allows concurrent read, write and flush operations, removes flush Interval and the need to flush manually in most cases.

Results
nats::publish_throughput/32
                        time:   [179.10 ms 192.58 ms 204.84 ms]
                        thrpt:  [74.493 MiB/s 79.232 MiB/s 85.195 MiB/s]
                 change:
                        time:   [-41.195% -35.317% -29.200%] (p = 0.00 < 0.05)
                        thrpt:  [+41.244% +54.600% +70.054%]
                        Performance has improved.
nats::publish_throughput/1024
                        time:   [403.75 ms 422.94 ms 448.20 ms]
                        thrpt:  [1.0639 GiB/s 1.1274 GiB/s 1.1810 GiB/s]
                 change:
                        time:   [-28.817% -24.450% -19.198%] (p = 0.00 < 0.05)
                        thrpt:  [+23.760% +32.363% +40.483%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking nats::publish_throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 36.5s.
nats::publish_throughput/8192
                        time:   [3.3836 s 3.4445 s 3.5087 s]
                        thrpt:  [1.0872 GiB/s 1.1075 GiB/s 1.1274 GiB/s]
                 change:
                        time:   [-14.561% -11.087% -7.1463%] (p = 0.00 < 0.05)
                        thrpt:  [+7.6963% +12.469% +17.043%]
                        Performance has improved.

nats::publish_amount/32 time:   [189.48 ms 202.11 ms 211.73 ms]
                        thrpt:  [2.3615 Melem/s 2.4739 Melem/s 2.6388 Melem/s]
                 change:
                        time:   [-36.034% -29.071% -21.665%] (p = 0.00 < 0.05)
                        thrpt:  [+27.657% +40.987% +56.333%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild
nats::publish_amount/1024
                        time:   [388.18 ms 401.24 ms 416.89 ms]
                        thrpt:  [1.1994 Melem/s 1.2462 Melem/s 1.2881 Melem/s]
                 change:
                        time:   [-27.640% -22.928% -17.446%] (p = 0.00 < 0.05)
                        thrpt:  [+21.133% +29.749% +38.199%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
Benchmarking nats::publish_amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 35.4s.
nats::publish_amount/8192
                        time:   [3.3951 s 3.4861 s 3.5708 s]
                        thrpt:  [140.02 Kelem/s 143.43 Kelem/s 147.27 Kelem/s]
                 change:
                        time:   [-9.5212% -5.8424% -2.1558%] (p = 0.01 < 0.05)
                        thrpt:  [+2.2033% +6.2049% +10.523%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

nats::subscribe_amount/32
                        time:   [297.79 ms 304.72 ms 312.84 ms]
                        thrpt:  [1.5983 Melem/s 1.6409 Melem/s 1.6790 Melem/s]
                 change:
                        time:   [-30.307% -26.410% -22.120%] (p = 0.00 < 0.05)
                        thrpt:  [+28.402% +35.888% +43.487%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking nats::subscribe_amount/1024: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.3s.
nats::subscribe_amount/1024
                        time:   [460.27 ms 496.72 ms 529.78 ms]
                        thrpt:  [943.79 Kelem/s 1.0066 Melem/s 1.0863 Melem/s]
                 change:
                        time:   [-12.146% -5.4558% +2.3860%] (p = 0.21 > 0.05)
                        thrpt:  [-2.3304% +5.7707% +13.825%]
                        No change in performance detected.
Benchmarking nats::subscribe_amount/8192: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 39.7s.
nats::subscribe_amount/8192
                        time:   [4.5243 s 5.0122 s 5.4192 s]
                        thrpt:  [92.264 Kelem/s 99.756 Kelem/s 110.51 Kelem/s]
                 change:
                        time:   [-21.637% -12.909% -6.3865%] (p = 0.00 < 0.05)
                        thrpt:  [+6.8223% +14.822% +27.611%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) low mild

Benchmarking nats::request_amount/32: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.1s.
nats::request_amount/32 time:   [571.27 ms 673.26 ms 771.24 ms]
                        thrpt:  [12.966 Kelem/s 14.853 Kelem/s 17.505 Kelem/s]
                 change:
                        time:   [-30.895% -18.364% -6.6358%] (p = 0.01 < 0.05)
                        thrpt:  [+7.1075% +22.495% +44.707%]
                        Performance has improved.
Benchmarking nats::request_amount/1024: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.6s.
nats::request_amount/1024
                        time:   [810.40 ms 831.63 ms 850.35 ms]
                        thrpt:  [11.760 Kelem/s 12.025 Kelem/s 12.340 Kelem/s]
                 change:
                        time:   [-5.5714% -2.0854% +1.2819%] (p = 0.29 > 0.05)
                        thrpt:  [-1.2657% +2.1299% +5.9001%]
                        No change in performance detected.
Benchmarking nats::request_amount/8192: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.1s.
nats::request_amount/8192
                        time:   [1.0073 s 1.0170 s 1.0259 s]
                        thrpt:  [9.7479 Kelem/s 9.8333 Kelem/s 9.9279 Kelem/s]
                 change:
                        time:   [-3.1825% -1.1533% +1.3898%] (p = 0.38 > 0.05)
                        thrpt:  [-1.3707% +1.1668% +3.2871%]
                        No change in performance detected.

Benchmarking jetstream::sync_publish_throughput/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 35.7s.
jetstream::sync_publish_throughput/32
                        time:   [3.8331 s 3.9169 s 4.0004 s]
                        thrpt:  [390.58 KiB/s 398.92 KiB/s 407.63 KiB/s]
                 change:
                        time:   [-4.6425% -1.8992% +0.9463%] (p = 0.22 > 0.05)
                        thrpt:  [-0.9374% +1.9359% +4.8685%]
                        No change in performance detected.
Benchmarking jetstream::sync_publish_throughput/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 40.3s.
jetstream::sync_publish_throughput/1024
                        time:   [4.0426 s 4.1048 s 4.1671 s]
                        thrpt:  [11.718 MiB/s 11.895 MiB/s 12.078 MiB/s]
                 change:
                        time:   [-3.8321% -2.0835% -0.2738%] (p = 0.05 > 0.05)
                        thrpt:  [+0.2745% +2.1278% +3.9848%]
                        No change in performance detected.
Benchmarking jetstream::sync_publish_throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 56.9s.
jetstream::sync_publish_throughput/8192
                        time:   [5.8264 s 5.8747 s 5.9109 s]
                        thrpt:  [66.086 MiB/s 66.493 MiB/s 67.044 MiB/s]
                 change:
                        time:   [-0.9895% -0.0361% +0.8900%] (p = 0.95 > 0.05)
                        thrpt:  [-0.8821% +0.0361% +0.9994%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low severe

Benchmarking jetstream sync publish messages amount/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 39.7s.
jetstream sync publish messages amount/32
                        time:   [3.7907 s 3.8869 s 3.9735 s]
                        thrpt:  [12.583 Kelem/s 12.864 Kelem/s 13.190 Kelem/s]
                 change:
                        time:   [-6.4282% -3.7565% -1.3943%] (p = 0.01 < 0.05)
                        thrpt:  [+1.4140% +3.9031% +6.8698%]
                        Performance has improved.
Benchmarking jetstream sync publish messages amount/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 41.4s.
jetstream sync publish messages amount/1024
                        time:   [4.1361 s 4.2033 s 4.2631 s]
                        thrpt:  [11.728 Kelem/s 11.895 Kelem/s 12.089 Kelem/s]
                 change:
                        time:   [-2.3992% -0.7660% +1.1244%] (p = 0.48 > 0.05)
                        thrpt:  [-1.1119% +0.7719% +2.4582%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking jetstream sync publish messages amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 57.8s.
jetstream sync publish messages amount/8192
                        time:   [5.9014 s 5.9465 s 5.9774 s]
                        thrpt:  [8.3649 Kelem/s 8.4083 Kelem/s 8.4725 Kelem/s]
                 change:
                        time:   [-0.6851% +0.2504% +1.1608%] (p = 0.64 > 0.05)
                        thrpt:  [-1.1475% -0.2498% +0.6898%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low severe

jetstream async publish throughput/32
                        time:   [176.22 ms 202.89 ms 230.19 ms]
                        thrpt:  [6.6288 MiB/s 7.5206 MiB/s 8.6590 MiB/s]
                 change:
                        time:   [-43.664% -29.790% -10.814%] (p = 0.02 < 0.05)
                        thrpt:  [+12.125% +42.430% +77.506%]
                        Performance has improved.
Benchmarking jetstream async publish throughput/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.7s or enable flat sampling.
jetstream async publish throughput/1024
                        time:   [267.21 ms 304.52 ms 334.80 ms]
                        thrpt:  [145.84 MiB/s 160.35 MiB/s 182.73 MiB/s]
                 change:
                        time:   [-13.532% +8.8176% +36.756%] (p = 0.50 > 0.05)
                        thrpt:  [-26.877% -8.1031% +15.650%]
                        No change in performance detected.
Benchmarking jetstream async publish throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.5s.
jetstream async publish throughput/8192
                        time:   [559.21 ms 661.97 ms 765.84 ms]
                        thrpt:  [510.06 MiB/s 590.10 MiB/s 698.52 MiB/s]
                 change:
                        time:   [-27.185% -10.898% +7.5984%] (p = 0.28 > 0.05)
                        thrpt:  [-7.0618% +12.231% +37.334%]
                        No change in performance detected.

jetstream::async_publish_messages_amount/32
                        time:   [198.43 ms 226.47 ms 254.69 ms]
                        thrpt:  [196.32 Kelem/s 220.78 Kelem/s 251.98 Kelem/s]
                 change:
                        time:   [-34.279% -22.445% -6.1193%] (p = 0.02 < 0.05)
                        thrpt:  [+6.5181% +28.940% +52.159%]
                        Performance has improved.
jetstream::async_publish_messages_amount/1024
                        time:   [277.26 ms 307.25 ms 342.50 ms]
                        thrpt:  [145.98 Kelem/s 162.73 Kelem/s 180.34 Kelem/s]
                 change:
                        time:   [-22.326% -9.2648% +6.6219%] (p = 0.27 > 0.05)
                        thrpt:  [-6.2106% +10.211% +28.744%]
                        No change in performance detected.
Benchmarking jetstream::async_publish_messages_amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.2s.
jetstream::async_publish_messages_amount/8192
                        time:   [525.45 ms 571.20 ms 620.19 ms]
                        thrpt:  [80.620 Kelem/s 87.535 Kelem/s 95.157 Kelem/s]
                 change:
                        time:   [-32.263% -24.273% -15.591%] (p = 0.00 < 0.05)
                        thrpt:  [+18.470% +32.053% +47.629%]
                        Performance has improved.

Full PR (9e0be06) against the first step

These are the extra 3 commits I did at the end to further optimize writes. Small writes get flattened, while big ones bypass all buffering (at least on our side). When the connection supports it we do vectored writes. All of this allowed us to remove BufWriter.

Results
nats::publish_throughput/32
                        time:   [125.83 ms 148.50 ms 167.17 ms]
                        thrpt:  [91.278 MiB/s 102.75 MiB/s 121.26 MiB/s]
                 change:
                        time:   [-36.522% -26.921% -14.196%] (p = 0.00 < 0.05)
                        thrpt:  [+16.545% +36.839% +57.535%]
                        Performance has improved.
nats::publish_throughput/1024
                        time:   [221.91 ms 234.66 ms 244.72 ms]
                        thrpt:  [1.9485 GiB/s 2.0320 GiB/s 2.1488 GiB/s]
                 change:
                        time:   [-48.599% -44.517% -40.786%] (p = 0.00 < 0.05)
                        thrpt:  [+68.880% +80.235% +94.547%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking nats::publish_throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.3s.
nats::publish_throughput/8192
                        time:   [1.3366 s 1.3827 s 1.4257 s]
                        thrpt:  [2.6756 GiB/s 2.7589 GiB/s 2.8541 GiB/s]
                 change:
                        time:   [-61.355% -59.858% -58.453%] (p = 0.00 < 0.05)
                        thrpt:  [+140.69% +149.11% +158.77%]
                        Performance has improved.

Benchmarking nats::publish_amount/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.7s or enable flat sampling.
nats::publish_amount/32 time:   [138.35 ms 161.47 ms 177.04 ms]
                        thrpt:  [2.8243 Melem/s 3.0965 Melem/s 3.6139 Melem/s]
                 change:
                        time:   [-24.646% -17.041% -8.8368%] (p = 0.00 < 0.05)
                        thrpt:  [+9.6934% +20.542% +32.707%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
nats::publish_amount/1024
                        time:   [225.59 ms 236.15 ms 244.57 ms]
                        thrpt:  [2.0444 Melem/s 2.1173 Melem/s 2.2164 Melem/s]
                 change:
                        time:   [-44.255% -41.143% -38.165%] (p = 0.00 < 0.05)
                        thrpt:  [+61.722% +69.904% +79.389%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) low severe
Benchmarking nats::publish_amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 13.8s.
nats::publish_amount/8192
                        time:   [1.2914 s 1.3426 s 1.3932 s]
                        thrpt:  [358.87 Kelem/s 372.42 Kelem/s 387.16 Kelem/s]
                 change:
                        time:   [-63.301% -61.488% -59.751%] (p = 0.00 < 0.05)
                        thrpt:  [+148.45% +159.66% +172.49%]
                        Performance has improved.

Benchmarking nats::subscribe_amount/32: Collecting 10 samples in estimated 6.5198 s (20 iterations)thread 'tokio-runtime-worker' panicked at async-nats/benches/core_nats.rs:105:38:
called `Result::unwrap()` on an `Err` value: PublishError(SendError { .. })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
nats::subscribe_amount/32
                        time:   [289.57 ms 304.66 ms 320.14 ms]
                        thrpt:  [1.5618 Melem/s 1.6412 Melem/s 1.7267 Melem/s]
                 change:
                        time:   [-5.8643% -0.0208% +5.5422%] (p = 1.00 > 0.05)
                        thrpt:  [-5.2512% +0.0208% +6.2296%]
                        No change in performance detected.
nats::subscribe_amount/1024
                        time:   [371.81 ms 382.83 ms 393.31 ms]
                        thrpt:  [1.2713 Melem/s 1.3061 Melem/s 1.3448 Melem/s]
                 change:
                        time:   [-28.260% -22.927% -16.488%] (p = 0.00 < 0.05)
                        thrpt:  [+19.743% +29.748% +39.392%]
                        Performance has improved.
Benchmarking nats::subscribe_amount/8192: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 16.1s.
nats::subscribe_amount/8192
                        time:   [1.5561 s 1.5920 s 1.6166 s]
                        thrpt:  [309.30 Kelem/s 314.07 Kelem/s 321.33 Kelem/s]
                 change:
                        time:   [-70.705% -68.238% -64.749%] (p = 0.00 < 0.05)
                        thrpt:  [+183.68% +214.84% +241.36%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low severe

Benchmarking nats::request_amount/32: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.7s.
nats::request_amount/32 time:   [736.09 ms 770.38 ms 805.84 ms]
                        thrpt:  [12.409 Kelem/s 12.981 Kelem/s 13.585 Kelem/s]
                 change:
                        time:   [-0.9874% +14.426% +35.611%] (p = 0.11 > 0.05)
                        thrpt:  [-26.260% -12.607% +0.9972%]
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
Benchmarking nats::request_amount/1024: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.2s.
nats::request_amount/1024
                        time:   [773.74 ms 800.45 ms 829.27 ms]
                        thrpt:  [12.059 Kelem/s 12.493 Kelem/s 12.924 Kelem/s]
                 change:
                        time:   [-7.3817% -3.7485% +0.4523%] (p = 0.11 > 0.05)
                        thrpt:  [-0.4503% +3.8945% +7.9700%]
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
Benchmarking nats::request_amount/8192: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.2s.
nats::request_amount/8192
                        time:   [821.09 ms 838.74 ms 858.39 ms]
                        thrpt:  [11.650 Kelem/s 11.923 Kelem/s 12.179 Kelem/s]
                 change:
                        time:   [-19.397% -17.524% -15.553%] (p = 0.00 < 0.05)
                        thrpt:  [+18.418% +21.248% +24.065%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking jetstream::sync_publish_throughput/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 38.8s.
jetstream::sync_publish_throughput/32
                        time:   [3.7498 s 3.8475 s 3.9401 s]
                        thrpt:  [396.56 KiB/s 406.11 KiB/s 416.69 KiB/s]
                 change:
                        time:   [-4.7532% -1.7710% +1.2782%] (p = 0.32 > 0.05)
                        thrpt:  [-1.2620% +1.8029% +4.9904%]
                        No change in performance detected.
Benchmarking jetstream::sync_publish_throughput/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 40.7s.
jetstream::sync_publish_throughput/1024
                        time:   [4.1059 s 4.1283 s 4.1503 s]
                        thrpt:  [11.765 MiB/s 11.828 MiB/s 11.892 MiB/s]
                 change:
                        time:   [-1.0258% +0.5737% +2.2201%] (p = 0.52 > 0.05)
                        thrpt:  [-2.1719% -0.5704% +1.0364%]
                        No change in performance detected.
Benchmarking jetstream::sync_publish_throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 48.5s.
jetstream::sync_publish_throughput/8192
                        time:   [4.9253 s 4.9878 s 5.0443 s]
                        thrpt:  [77.439 MiB/s 78.317 MiB/s 79.310 MiB/s]
                 change:
                        time:   [-16.206% -15.097% -13.982%] (p = 0.00 < 0.05)
                        thrpt:  [+16.254% +17.782% +19.341%]
                        Performance has improved.

Benchmarking jetstream sync publish messages amount/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 37.4s.
jetstream sync publish messages amount/32
                        time:   [3.8466 s 3.8992 s 3.9534 s]
                        thrpt:  [12.647 Kelem/s 12.823 Kelem/s 12.998 Kelem/s]
                 change:
                        time:   [-2.2477% +0.3165% +3.2240%] (p = 0.83 > 0.05)
                        thrpt:  [-3.1233% -0.3155% +2.2993%]
                        No change in performance detected.
Benchmarking jetstream sync publish messages amount/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 37.1s.
jetstream sync publish messages amount/1024
                        time:   [4.1068 s 4.1717 s 4.2296 s]
                        thrpt:  [11.821 Kelem/s 11.986 Kelem/s 12.175 Kelem/s]
                 change:
                        time:   [-2.8102% -0.7515% +1.3158%] (p = 0.51 > 0.05)
                        thrpt:  [-1.2987% +0.7572% +2.8915%]
                        No change in performance detected.
Benchmarking jetstream sync publish messages amount/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 49.9s.
jetstream sync publish messages amount/8192
                        time:   [4.9066 s 4.9607 s 5.0193 s]
                        thrpt:  [9.9615 Kelem/s 10.079 Kelem/s 10.190 Kelem/s]
                 change:
                        time:   [-17.673% -16.578% -15.460%] (p = 0.00 < 0.05)
                        thrpt:  [+18.288% +19.873% +21.466%]
                        Performance has improved.

jetstream async publish throughput/32
                        time:   [151.12 ms 167.54 ms 185.15 ms]
                        thrpt:  [8.2414 MiB/s 9.1075 MiB/s 10.097 MiB/s]
                 change:
                        time:   [-29.757% -17.424% -1.7062%] (p = 0.05 > 0.05)
                        thrpt:  [+1.7358% +21.101% +42.363%]
                        No change in performance detected.
Benchmarking jetstream async publish throughput/1024: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.0s or enable flat sampling.
jetstream async publish throughput/1024
                        time:   [245.90 ms 303.26 ms 329.56 ms]
                        thrpt:  [148.16 MiB/s 161.01 MiB/s 198.57 MiB/s]
                 change:
                        time:   [-42.790% -23.917% -0.8215%] (p = 0.07 > 0.05)
                        thrpt:  [+0.8283% +31.436% +74.795%]
                        No change in performance detected.
Benchmarking jetstream async publish throughput/8192: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.4s.
jetstream async publish throughput/8192
                        time:   [291.90 ms 311.93 ms 329.81 ms]
                        thrpt:  [1.1567 GiB/s 1.2230 GiB/s 1.3068 GiB/s]
                 change:
                        time:   [-59.728% -52.879% -43.652%] (p = 0.00 < 0.05)
                        thrpt:  [+77.469% +112.22% +148.31%]
                        Performance has improved.

Benchmarking jetstream::async_publish_messages_amount/32: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.4s or enable flat sampling.
jetstream::async_publish_messages_amount/32
                        time:   [209.93 ms 234.97 ms 251.19 ms]
                        thrpt:  [199.05 Kelem/s 212.79 Kelem/s 238.18 Kelem/s]
                 change:
                        time:   [-21.216% -7.8455% +8.8059%] (p = 0.37 > 0.05)
                        thrpt:  [-8.0933% +8.5134% +26.929%]
                        No change in performance detected.
jetstream::async_publish_messages_amount/1024
                        time:   [154.71 ms 195.91 ms 236.58 ms]
                        thrpt:  [211.34 Kelem/s 255.21 Kelem/s 323.18 Kelem/s]
                 change:
                        time:   [-51.072% -36.237% -20.586%] (p = 0.00 < 0.05)
                        thrpt:  [+25.922% +56.830% +104.38%]
                        Performance has improved.
jetstream::async_publish_messages_amount/8192
                        time:   [297.34 ms 321.73 ms 345.42 ms]
                        thrpt:  [144.75 Kelem/s 155.41 Kelem/s 168.16 Kelem/s]
                 change:
                        time:   [-49.597% -43.674% -36.992%] (p = 0.00 < 0.05)
                        thrpt:  [+58.710% +77.539% +98.399%]
                        Performance has improved.

Copy link
Collaborator

@caspervonb caspervonb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'm liking it. Going to avoid nitpicking as this is rather huge.
There are some unwraps in the write macros, but since this is coming from internal, I suppose its fine.

async-nats/src/connection.rs Show resolved Hide resolved
async-nats/src/lib.rs Show resolved Hide resolved
Copy link
Collaborator

@caspervonb caspervonb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, did a final read-through, going to be an approval from me 🎉

LGTM!

Copy link
Member

@Jarema Jarema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did run few more benches on linux with solid results.
That concludes work on that PR.

LGTM!
Great effort, thanks for that!

Sorry you had to wait quite a bit.

@paolobarbolini
Copy link
Contributor Author

🥳. I've fixed the merge conflict

@Jarema
Copy link
Member

Jarema commented Sep 19, 2023

Can you rebase the branch so commit messages are consistent with others in the repo? (Capitalize first letter, simple imperative mood) @paolobarbolini

@Jarema Jarema merged commit fe79b4d into nats-io:main Sep 19, 2023
13 checks passed
@Jarema Jarema mentioned this pull request Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Constant CPU usage with axum/tower & ConnectOptions Inefficient networking Add flush parameter to publish
4 participants